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1. Introduction . 

Let X be an n-dimensional random variable whose density • function p 
is a convex combination of normal densities, i.e. , 

p(x) “ X e/R*', 

where 

♦ 


and 


Pi<x) 




1/f 


-l/2Cx-pJ)'^E° 


-1 


Cx-w”) 


If {x. , „ c fP " is an independent sample of observations on x, then 

k k=l, . * . ,N 

a maximum-likelihood estimate of the parameters ^ ^ 

is a choice of parameters ... m locally maximizes the 

log-°llkelihQod function 


N 


in which p is evaluated with the true parameters 

replaced by the estimate ^*^i»^i>^i^i=l m* following, it is 

usually clear from the context which parameters are used in evaluating the 
density functions p^ and p. Therefore, these parameters are explicitly 
pointed out only i7hen some ambiguity exists.) 

Clearly, L is a differentiable function of the parameters to be estimated 
Equating to zero the partial derivatives of L with respect to these parameters 
one obtains, after a straightforward calculation, the following necessary 
conditions for a maximum-likelihood estimate; 


(l.a) 

(l.b) 

(l.c) 


ot. “ TT- ,2 


1 N k=l pCxj^) 

„ ,1 9 « /fi ? 

‘‘l A k p(x^)' / A pCx,.)' 


rl N 

r A r» 


■ '5 u5i 'v-i’ -7(5;;) 


tPi^Vt /rl N Pi^*k^ 


i=l,. . . ,m 


} / 


N k=l p(xj^) 


■} 
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These are known as the likelihood equations . As observed by Cramer 
Huiturbazar [7], Wald 'CllDj Chanda [1], and others, there is, loosely 
speaking, a unique solution of the likelihood equations which tends in pro- 
bability to the true parameters as the sample size N approaches infinity. 
Furthermore, this solution in a maximum-likelihood estimate, indeed, the 
unique consistent maximum-likelihood estimate. (Strictly speaking, given any 
sufficiently small neighborhood of the true parameters, there is, with probability 
tending to 1 as N approaches infinity, a unique solution of the likelihood 
equations in that neighborhood, and this solution is a maximum-likelihood estimate. 
For completeness, we present a brief proof of this result in an appendix.) 

This note is addressed to the problem of determining this consistent maximum- 
likelihood estimate by successive approximations. 

The likelihood equations, as written, suggest the following iterative 
procedure for obtaining a solution: Beginning with some set of starting values, 
obtain successive approximations to a solution by Inserting the preceding 
approximations in the expressions on the right-hand sides of (l.a), (l.b) , 
and (l.c). This scheme is attractive for its relative ease of implementation, 
and we discuss below the findings of several authors concerning its use in 
obtaining maximum-likelihood estimates. For a discussion of other methods of 
determining maximum-likelihood estimates, sec Kale [8] and Wolfe [13J as 
well as the authors given below. 

Empirical studies of Day [3], Duda and Hart [4], and Hasselblad [5] 
suggest that this scheme is convergent and that convergence is particularly 
fast when the component normal densities in p are’ "widely separated" in a 
certain sense. Unfortunately, the likelihood equations have many solutions 


in general, and the Iteratea may converge to solutions, Including ’’singular 
solutions" (see [4]), which are not the consistent maximum-likelihood 
estimate if care is not taken in the choice of starting values. No theoretical 
evident'-' of convergence is given in [3], [4], or [5]. 

Peters and Coberly [1(1 have proved that, if all of the parameters 
and are held fixed, then the iterative procedure suggested by the equation 

(l.a) alone converges locally to a maximum-likelihood estimate of the para- 
meters i=l,,..,m. (An iterative procedure is said to converge locally 

to a limit if the iterates converge to that limit whenever the starting values 
are sufficiently near that limit.) They also report on numerical studies in 
which the computational feasibility of this procedure is demonstrated. Walker 
[12] has shown that, if all the parameters and 2^ are held fixed, then 

the iterative procedure suggested by the equation (l.b) converges locally to 
a maximum-likelihood estimate of the means y^, . . . ,m, provided that 

either m = 2 or the component’ normal densities in p are "widely separated" 
in a certain sense. 

In the following, we present a general iterative procedure for tV-termining 
the consistent maximum-likelihood estimate, of which the above procedure is a 
special case. Indeed, our procedure is in some ways like a steepest-ascent 
method, and the above procedure is obtained when a certain "Stpp-size" is 
taken to be 1. We show that, if the "step-size" is sufficiently small, then 
with probability approaching 1 as the sample size approaches infinity, this 
procedure converges locally to the consistent maximum-likelihood estimate. This 
scheme is as easily implemented in general as in the above special case, and 
it appears to hold considerable promise as an effective tool for obtaining con- 
sistent maximum-likelihood estimates in many situations of practical Interest. 



5 


2. The general Iterative procedure . 

In order to minimize notational difficulties, we introduce several vector 
spaces and give useful representations of their elements. For each i, 

1 S i 5 m, elements of the vector spaces (R IR 

and the set of all real, symmetric nXn matrices, respectively. We dev' te 
by OCiTK* and ^ the respective m-fold direct sums of these spaces with 
themselves, and we represent elements of , and ^ as columns 


a » 

(!■' 

• 


• 

• 


ij’ 

[^i 

• 


\a 
\ mi 


l^i 


\lj 




It will be convenient to represent elements of the direct sum ^ 9- & 
either 


as 


's' 


\^l 


or 


hi\ 


t 

a 


m 


m 

fx 


m I 


If, for i « l,...,m, we denote 


A^(a;y,2) 


N k^l p(x^) » 
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_ _ 1 N Pi<*k^ / 1 ? Pl^V 

M^(ctiy»E) " N p(x^) N k«l p(Xj^) ’ 

1 N . .. .T / \ \ 

Sj^(ct,y,E) “ ^ p(Xj^)' y N k"l p(Xj^) ’ 

then the likelihood equations can be written as 

a \ A(a,lI,S) \ 

p j “ M(a,y,I) J , 

£/ ^SCa.y.Z)/ 

where 
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a(k+l) 

^(k+1) 

jOc+l) 




for k " 1,2,3 This procedure becomes the procedure given in the intro- 

duction when € » 1. 

In the next section, we show that if £ is a sufficiently small positive 
number, then, with probability approaching 1 as the sample size N approaches 
infinity, this procedure converges locally to the consistent maximum-likelihood 
estimate. This is done by showing that, with probability approaching 1 as 
N approaches infinity, the operator is locally contractive (in a suitable 

vei'.tor norm) near that estimate, provided c is a sufficiently small positive 
number. In saying that is locally contractive near a point 

-1 ’ 

y I £ 01^ ^ ® -<f, we mean that there is a vector norm | | [ | on ^ A 


and a number X, 0 5 X < 1 such that 


J') “ y ^ ^ y^ " ^ 
Is/ W Vs 


whenever 


lies sufficiently near j y 

\s 


3. The local contractibility and convergence results. 


We now establish the following 



Theorem . With probability -approaching 1 as N approaches infinity, 
is a locally contractive operator (in eom norm on Utl 9^) near the 
consistent maximum- likelihood estimate whenever e is a sufficiently small 
positive number. 

Our main result is an immediate consequence of this theorem, which we 
state as a 


Corollary . With probability approaching 1 as N approaches Infinity, the 
Iterative procedure (-0 converges locally to the consistent maximum-likelihood 
estimate whenever e is a sufficiently small positive number. 

» 

Throughout the proof of the theorem, the symbol ”V" denotes the Frechet 
derivative of a vector-valued function of a vector variable. When ambiguity 
exists, the specific vector variable of differentiation appears as a subscript 
of this symbol. For questions concerning the definition and properties of 
Frechet derivatives, see Luenberger [9], 


Proof of the theorem: 


Let 



2 / 


be the consistent maximum-likelihood estimate. 


We assume that ^ 0, i = l,..,,m. (As N tends to infinity, the probability 
approaches 1 that this is ths case.) It must be shown that, with probability 
approaching 1 as N approach!>s infinity, an inequality of the form (5) 
holds whenever e is a sufficiently small positive number. 

For any norm on one can write 


<D (a",y",E")- 
€ 


= V<I>^(a,y,j:) 


+ 0 
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Consequently, the theorem will be proved if It can be shown that, for small 

positive e, V (a,]T,T) converges in probability to an operator which has 

€ 

norm lees than 1 with respect to a suitable norm on 

One can write 7<|>^ as (l-e)I plus a matrix of Frechet derivatives: 


/ V V \ 

“ (l-e)I + e | 7^ TjjM 7^^ 

\V V V j 


This is consistent with our representation of element? of as 

columns . 

I 

The entries of the above matrix can themselves be represented as matrices 
of Frechet derivatives. For i = l,...,m, we introduce inner products 
^x,y>^ » x'^(a^2j^^)y on and <A,B>'^ = tr{A(-^ on the space cf 

real, symmetric nXn matrices. After a straightforward but extremely 
tedious calculation, one obtains with the aid of equations . (1) that 


N 


7-A(a,u,E) = I - (diag a^){| 


,1 N 

7^A(a,y,E) « -(diag 


('Pi<V \ 


P(-V 

• 

P (Xj^) ' 

9 

1 1 

# 

4 

\ / 

1 P'V/ 

‘■i<V \ 

MV 

p(V 

4 

P<V 

4 

, '■m'V 

L Pm< V 

\ P(*k> / 

\ P<V 


}' 


. 1 i uu I ' 1 1 
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.(dlag a^) ' 


_,i )/v -|i - l.«> 


Vj^H(a,U,E) - -t-J- Ji 




^ (X.-M) 
p(Xk> 


Pl<V 

p(Xfc) 


V P(*k> 


v_H(“.p.^) “ I ■■’ lil 






VjM(H,U,I) - (diag ^ Ji - 


f— E 
k=l 




, . ., s|\ <r"^Vv -n Ux. ~u - I.* 

p(x^^) ^V^m7 \ P(Xk^ m m ^ m 


11 


VjjS<a,y»Z) « -<diag 


VpS(a,vi,S) 


•(diag j^Sj^ 


/ p(J)t=i"(V'‘i>(V''i'-- 


/ \ 
I p'V 

0 

P^CV -r 

-Iiy 


t 

1 '’m'V 
[ p^V/ 

/Pl^V -1 T \ 

1 p(Xj^)^^l ~ 

/Pi'v, 

p(Xj,)'V 




Pm<V ■ „ ,, 

/ \ n 


1 ? Pi<-V ,.-i 


VjSCa.M.J) . (dlag 




(dlag 2^){y 




P(V 


^- 1 , 


-. " 

The inner products ''*’*^1* together with scalar multiplication 

on induce an inner product <*,*>' on W ^jS- Setting 
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V(x) 





■W 




\ 

\ 


Pi -1 T 

Too 

i 


one obtains 


1 0 


V(p (ct,p,2) 


•1 N _T T "\ 

0 I e(diag^ 

•1 N Pj Ci^i.) _i T' 

0 0 (i-t)i + £(diags, ^ii-r7;rTl2i 


'1 k=l p(Xj^) 




-e 


I (dlag a^) 0 ° \ 

0 10 

^ 0 0 (dlag 2^) 
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We have assumed Chat: the solution 


li 

\^J 


consistent. Denoting the true parameters by 


of the likelihood equations 'is 


— o 


, one verifies without 


difficulty that V<t» (a,vi,£) converges in probability to E(V4) as 

N approaches Icflnily. A straightforward calculation yields 


E(V(l)^(a°,]4°,S°))« 


1 1 0 o\ 

0 10 

\0 0 l/ 


(diag a°) 0 

e 0 I 

\ 0 0 


0 

(diag 


{ / V°(x)<V°(x) ,«>p*^(x)dx}. 

iR" 


(In this expression, the superscript "o” on V and p indicates that the 
true parameters arr used in evaluating these functions.) Thus 
E(V<l^(a°,vi°,Z°)) is an operator on (?T® of the form I - eQR, where 

Q and R are positive-definite and symmetric with respect to the inner product 
<»,!>. Since QR is positive-definite and symmetric with respect to the 
inner product > on Ot ^ it must be the case that, for small 

positive e, the operator norm of E(V4^(o°,iI°,^)) , with respect to the 
inner product is less than 1, So, for small positive e, 

converges in probability to an operator having norm less than 1 

€ 

with respect to the inner product <*>Q > on (^9^ ^ This completes the 

proof of the theorem. 
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We remark that, la order for the conclusion of the theorem to hold, It is 
sufficient to take e leas than. ' Indeed, it ls ,cbser'/ed in the 

proof of the theorem that E « I ■■ eQR, where QR is positive- 
definite and symmetric with respect to a certain inner product and, hence, has 
positive eigenvalues. Denoting the spectral radius of QR by ^i(QR), one then 
verifies that has operator norm less than 1, with respect 

to some vector norm, whenever e is less than ' p ^ Q^ * t^l*) Now 

p<QR> < tr{QR) 


m 


m f p.(x)^ /*Pi(x)^ ^ 

l=l“i Jn J P(x) 

R /*p (x)2 

1=1 / ill / 

j ” tr{(E~^(x-ii^)(x“y^)'^“I)^}p^(x)dx} 

-rn 


1 J. HI/ 2 

m + mn T -r(.n 


+ n) 


m(n+l) (n*f2) 
2 


It follows that the conclusion of the theorem holds whenever c < • 

4. Concluding remarks . 

A number of numerical techniques for obtaining maximum-likelihood estimates 
of the parameters for a mixtiire of normal distributions have been discussed in the 
literature. In addition to the usual steepest-ascent method for obtaining a local 
maximum of the log-likelihood function, we mention in particular Newton’s method, 
the method of scoring, and the modifications of these procedures investigated by 
Kale [8] for obtaining solutions of the likelihood equations. It is our feeling 
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that the iterative procedure presented here offers considerable computational 
advantages over these procedures in many cases of practical Interest. 

Although Newton’s method and the method of scoring offer quadratic and 
near-quadratic convergence, respectively, for large sample sizes, they require 

at each iteration the inversion of a square matrix whose dimension is equal to 

j , m(n+l) (n+2) , 

the number of independent variables among the parameters, namely ^ “ i. 

Thus these methods may be less efficient computatxonally than the iterative 

procedure (4) if m and n are large, even though they may yield a satisfactory 

approximate solution after fewer iterations. The modified versions of Newton’s 

method and the method of scciring do not require the re-calculation of the Inverse 


of a large matrix at each step. However, quadratic convergence is not achieved 
with these modified methods, and multiplication by a large matrix must still be 


carried out at each iteration. 

Even though the partial derivatives of the log-likelihood function are not 
appreciably more difficult to evaluate than the expiassions used in defining 
the function cIj^, the procedure (4) appears to have two particular advantages 
over the steepest-ascent method. First, the successive iterates defined by 
(4) automatically satisfy the requisite constraints on the parameters, i.e. , 
the successive E^'s are, in probability, positive-definite and the successive 
aj,'B are positive and sura to 1. Second, by the remarks following the proof 
of the theorem, one knows that, in probability, there is a value of e, depending 
only on ra and n, for which the procedure (4) converges locally to the 
consistent maximum-likelihood estimate. We doubt that there exists? a step-size 
depending only on m and n which is similarly sufficient for the local 
convergence of the steepest-ascent procedure. 


Appendix 

We now give a brief proof of the existence and uniqueness of the consistent 
maximum- Ukellhood estimate. For the sake of generality, this is done in a 
somewhat broader context than is necessary for this paper. 

Let p(x,0) be a probability density function of a vector variable x eJR” 
and a vector parameter 0 If ^ is an independent sample of 

observations on a random variable x elR" whose probability density function is 
p(x,0**) for some 0° then a maximum- likelihood eslimate of 0° is a 

choice of 0 which locally maximizes the log-likelihood function 

N 

L " log p(Xj^,0). 

If p is a differentiable function of 0, then a necessary condition for a 
maximum- likelihood estimate is that '“the likelihood equations 



Oa i = 1, . . . ,V, 


be satisfied, where 0^ is the i — component of 0. In the following, our 
objective is to show that if p satisfies certain conditions, then, given any 
sufficiently small neighborhood of 0°, there is, with probability approaching 
1 as N approaches infinity, a unique solution of the likelihood equations in 
that neighborhood, and this solution is a maximum-likelihood estimate of 0°. 

We assume that p(x,0) satisfies the following conditions of Chanda [1]: 
(a) There is a neighborhood n of 0° such that for all 0 e fl, for almost 
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3 d ’ 3 ■ D V 

all X and for 1,J ,k*l) . . . |V, , gQ gQ'"» g6 90 39, 




and satisfy 


1 “ 1" j 




"iV i ' ^ £ /..\ I ^ 


laS”! ^ lo9 ae "I ^ lae.ae.ae. I ^ 

X X 1 X 1 K 


where and f^ are integrable and satisfies 


(b) The matrix J(0) 



(x)p(x,0°)dx < 


OO 



9 log p 9 loe P- 

39^ 80J 


P dx) 


Is positive-definite at 0*^. 


Let c^(0) 


fl 8L \ 

N 80, \ 

; 1 

ik-l 

\N »v/ 


It is immediately seen that oC(.0) = 0 if and only if the likelihood equations 
are satisfied, and that, by the weak law of large numbers, ,^(0°) converges in 
probability to zero. Furthermore, it follows from assumptions (a) and (b) 
above that there exists a neighborhood of 0° (contained in S2 and, for 

convenience, convex) and a positive e such chat, with probability approaching 
1 as N approaches infinity, Vj^(0) £ - e I for all 0 e (The Inequality 

is with respect to the usual ordering on symmetric matrices.) Denoting the 
spherical neighborhood of radius 6 about 0*^ by we establish the following 


Le^ma : ' With probability approaching 1 as N .approaches infinity, 
(1) oC. is one-to-one on 
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(il) contains the ball cf radius ed about dd(0°) whenever 

c fi°. 

Proof : We itay assume that 7*2(0) S - c I for all 0 e since the probability 

that this is the case tends to 1 as N approaches infinity. To prove (1), 
suppose that a55(o\) “ o2(0^) for 0^ and 0^ in Then 

0 » (0^ - eVlXioh -X(0^)i 

- (0^ - 0^'^{ + t[0^ - 0^])dt>(0^ - 0^). 

1 2 

The negative-definiteness of VoJf Implies that 0 “ 0 , and (1) is proved. 

To prove (11), suppose that and let 0^ be. a boundary point 

of . Then 

^(0^) + t[0^ - O°])dt>(0^ - 0°)* 


1 o T 

After left-multiplying this equation by (0 - 0 ) , one verifies using Schwarsa's 
inequality and the negative-definiteness of Vjd that 


x(.Qh -X(Q°)II s « I|q^ - 0° 


e 6, 


where || denotes the usual Euclidean norm on J}1 Since all boundary points 
of (il(^ig) images under X of boundary points of the proof of (ii) 

is complete. 

The desired result of this appendix follows Immediately from this lemma and 


the vemarks preceding It. Indeed, 1£ Si^ is any neighborhood o£ 0° which Is 

contiiined In then one can find n 6 £or which By the lemma, 

the prob'ability approaches 1 as N tends to infinity that ^ is one-to-one 

on and that and, hence, contain the ball of radius 

about oJ^(0°)* Sin^e o2!(0°) converges in probability to zero, one concludes 

that, with probability tending to 1 as N approaches infinity, there exists 
1 

a unique 0 e R for which J2f(0) “ 0, Since the probability also tends to 1 
that 7^ is negative-definite on this 0 is, with probability approaching 

1, a maximum- likelihood estimate. 
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